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(54) Parallel pipelined image rendering system 

(57) A rendering system for rendering graphic data 
includes a host processor generating data transfer and 
rendering commands, and a memory storing the data 
transfer and rendering commands In command queues. 
A command parser of the system concun^ently parses 



and processes the data transfer and rendering com- 
mands of the command queues, and a synchronization 
mechanism is used to synchronize the concun-ent pars- 
ing and processing the data transfer and rendering com- 
mands. 
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Description 

Field of the Invention 

[0001] The present invention Is related to the field of 5 
computer graphics, and in particular to rendering graph- 
ic data in a parallel pipelined rendering system. 

Background of the Invention 

[0002] Volume rendering is often used in computer 
graphics applications when three-dimensional data 
need to be visualized. The volume data can be scans of 
physical or medical objects, or atmospheric, geophysi- 
cal, or other scientific models where visualization of the 
data facilitates an understanding of the underlying real- 
worid structures represented by the data. 
[0003] With volume rendering, the internal structure, 
as well as the extemal surface features of physical ob- 
jects and models are visualized. Voxels are usually the 
fundamental data items used in volume rendering. A 
voxel is associated with a particular three-dimensional 
portion of the object or model. The coordinates (x, y, z) 
of each voxel map the voxel to a position within the rep- 
resented object or model. 

[0004] A voxel represents one or more values related 
to a particular location in the object or model. In a prior 
art volume rendering system, the values represented by 
a voxel can be a specific one of a number of different 
parameters, such as, density, tissue type, elasticity, or 
velocity. During rendering, the voxels are converted to 
color and opacity (RGBa) values, according to their in- 
tensity values, which can be projected onto a two-di- 
mensional image plane for viewing. 
[0005] One frequently used technique during render- 
ing is ray-casting. There, a set of imaginary rays are cast 
through the array of voxels. The rays originate from 
some view point or image plane. The intensity values of 
the voxels are sampled at points along the rays, and var- 
ious techniques are known to convert the sampled val- 
ues to pixel values. 

[0006] U.S. Patent Application Sn. 09/315,742, "Vol- 
ume rendering integrated circuit, "filed on May 20, 1 999 
by Burgess et al., is Incorporated herein by reference. 
There, a simple prior art volume rendering system is de- 
scribed. The rendering system Includes a host proces- 
sor connected to a volume graphics board (VGB) by an 
interconnect bus. The host processor can be any sort of 
personal computer or workstation including the bus. The 
host includes a host or main memory. The host memory 
can be any combination of intemal and extemal storage 
available to the processor, such as a main memory, a 
cache memory, and a disk memory. 
[0007] The VGB includes a voxel memory and a pixel 
memory connected to a Volume Rendering Chip (VRC). 
The VRC Includes all logic necessary for performing re- 
al-time interactive volume rendering operations. The 
VRC includes four interconnected rendering pipelines. 



In effect the VGB provides a rendering engine or "graph- 
ics accelerator" for the host. 

[0008] During operation, application software execut- 
ing In the host transfers the volume data to the VGB for 
rendering. In particular, the voxel data are transfen-ed 
from the host memory over the bus to the voxel memory. 
The application also stores other data, such as classifh 
catlon tables In the voxel memory. The application also 
loads rendering registers accessible by the pipelines. 
These registers specify how the rendering is to be per- 
fomned. After all data have been loaded, the application 
generates a command to initiate the rendering opera- 
tion. The pipelines execute the rendering command. 
When the rendering operation is complete, the output 
image is moved from the pixel memory to the host or to 
a 3D graphics card for display on an output device. In 
the Burgess system, the major operations take place se- 
quentially, that is, writer voxel data from the host to the 
VGB, render in the VRC, and then write pixel data from 
the VGB back to the host. 

[0009] However, for more complex rendering opera- 
tions, it may be desired to overiap memory transfers with 
rendering operations, or to set up one rendering opera- 
tion while a previous rendering operation is still in 
progress. Moreover, for efficiency reasons, it may be de- 
sirable to progress from one rendering operation to the 
next, without interrupt and without intervention by the 
application or other software executing in the host. Like- 
wise, it may be desirable to progress from completing a 
rendering operation to initiating a data transfer without 
interruption or intervention by software. 
[001 0] For example, while the pipelines are rendering 
a current volume, the image resulting from rendering a 
previous volume could be transferred back to the host, 
and a next volume to be rendered could be loaded. In 
this scenario, the rendering operation for the next vol- 
ume must not begin until the volume is completely load- 
ed, and the data transfer of the image must not begin 
until the rendering operation that produced it has com- 
pleted. This scenario quickly gets even more complicat- 
ed when the movement of data includes embedded pol- 
ygon geometry, because this increases the number of 
different data sources that must be loaded before ren- 
dering begins. 

[0011] Therefore, it is desired to provide an efficient 
means for controlling the rendering system so that ac- 
tivities such as rendering operations and data transfers 
of long duration can be overiapped in time, so that sep- 
arate activities can by synchronized with each other 
without intervention by host software, and so that the 
host software can determine the state of rendering ac- 
tivities and synchronize with activities in progress. 

Summary of the Invention 

[0012] The invention provides a rendering system for 
rendering graphic data that Includes a host processor 
generating data transfer and rendering commands. The 
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system also includes a memory storing the data transfer 
and rendering commands in a plurality of command 
queues, and a command parser concurrently parsing 
and processing the data transfer and rendering com- 
mands of the plurality of command queues. A synchro- 5 
nization mechanism is used to synchronize the concur- 
rent parsing and processing the data transfer and ren- 
dering commands. 

Brief Description of the Drawings io 
[0013] 

Figure 1 Is a block diagram of a rendering engine 
that uses command queues according to the inven- 
tion; 

Figure 2 is a block diagram of bus logic including 
command queue logic; 

Figure 3 is a block diagram of a prefen-ed embodi- 
ment of implementing command queues according 
to the invention; 

Figure 4 is graph of command queue states and 
state transitions; 

Figures 5 is a table of queue control registers; 

Figure 6 Is a table of queue status registers; 

Figure 7 is a table of a queue command header; and 

Figure 8 is a table of command queue registers 
used for synchronization. 

Detailed Description of the Prefered Embodiment 

Pipeline Organization 

[0014] Figure 1 shows the overall organization of a 
volume rendering system according to our invention. 
The system includes a host computer 1 0 connected to 
a rendering subsystem 100 by a bus 121. The host in- 
cludes a CPU 11 and a host memory 12. The host mem- 
ory 12 can store graphical application software 13 and 
graphic drivers 14 that communicate with operating sys- 
tem software. The software executes in the CPU 1 1 . The 
host memory can also stores rendering data, such as 
volume data sets, images, and various table. 
[001 5] The main modules of the rendering engine 1 00 
are a memory interface 110, bus logic 200, a sequencer 
130, and four parallel pipelines 140. Except for a pair of 
slice buffers 150, which span all four pipelines, the pipe- 
lines (A,B, C, and D) operate independently of each oth- 
er. In the preferred embodiment, all of the main modules 
110, 200, 130, 140, and 150 are implemented with hard- 
ware circuits on a single VLSI chip. It should be noted 



that this is only one example of a type of rendering en- 
gine that can use the invention. It should be noted that 
the rendering engine can also be a polygon rendering 
engine, pipelined or not, or any other type of hardware 
implemented rendering engine. 

Memory Interface 

[001 6] The memory interface 1 1 0 controls a rendering 
memory 160. In the prefen^ed embodiment, the render- 
ing memory 160 comprises eight double data rate 
(DDR) synchronous DRAM modules. The rendering 
memory provides a unified storage for all rendering data 
111 directly needed for rendering graphic objects in the 
pipelines 140, such as volumes (voxels), polygons, in- 
put and output images (pixels), depth values, and look- 
up tables. As described below, the rendering memory 
160 can also store command queue buffers 190 accord- 
ing to the invention. Altemately, the host CPU memory 
12 can store command queue buffers 190. That is, any 
individual command queue buffer can be stored either 
in the rendering memory 160 or the host memory 12. 
[0017] The memory interface 110 implements all 
memory accesses to the rendering memory 160, arbi- 
trates the requests of the bus logic 200 and the sequenc- 
er 130, and distributes data across the subsystem 100 
and the rendering memory 160. As an advantage, the 
high bandwidth memory accesses (reads and writes) 
are overiapped with rendering operations. 
[001 8] In the preferred embodiment, the VLS1 1 00 and 
rendering memory 160 are implemented as a single 
board that can be plugged into the PCI bus 121. 

Sequencer 

[001 9] The sequencer 1 30 controls the rendering en- 
gine. It determines what data to fetch from the memory, 
dispatches that data to the four pipelines 140, sends 
control information such as interpolation weights to the 
individual pipelines at the right time, and receives output 
data from rendering operations. The sequencer itself is 
a set of finite state machines controlled by a large 
number of registers. These are typically written by the 
bus logic 200 in response to load register commands of 
a particular command queue, but may also be written 
directly by software on the host system via PCI access- 
es to the bus logic. In either case, bus logic 200 writes 
register values to sequencer 130 via a FIFO 170. This 
enables the operation of bus logic 200 to be decoupled 
from that of the sequencer 1 30 and for the bus logic and 
sequencer to operate with different cycle times. 
[0020] Internally, the sequencer 130 maintains 
counters 131 needed to step through sample space one 
section at a time, to convert sample coordinates to per- 
muted voxel coordinates, and to generate control infor- 
mation needed by the stages of the four pipelines. The 
sequencer also provides the bus logic with status 171 
on rendering operations. The bus logic, in turn, provides 
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the host 10 with the rendering status. 
Bus Logic 

[0021] Figure 2 shows the internal structure of the bus 
logic 200. The bus logic 200 contains command queue 
logic 300 and provides an interface to the host 10 via 
bus 121, and to the memory interface 110 and the se- 
quencer 130 via the FIFO 170. If the host is a personel 
computer (PC) or workstation, then the bus 121 can be 
a 64-bit, 66 MHz PCI bus 121 confomning to version 2.2 
of the PCI specification. 

[0022] For this purpose, the bus logic includes a DMA 
interface (Dmalf) 210 that controls direct memory ac- 
cess (DMA) operations for transfering data to (Dmaln) 
and from (DmaOut) the rendering memory 160 via the 
memory interface 110. The DMA operations are burst- 
mode data transfers. The bus logic acts as a PCI bus 
master for this purpose. 

[0023] A target interface (Targetif) 220 controls the 
reading and writing of rendering registers 230. These 
accesses are direct reads and/or writes to and from in- 
dividual registers and individual locations in the memo- 
ry, initiated by the host 10 or by some other device on 
the PCI bus. 

[0024] The bus logic also includes memory control 
and arbitrarion logic 240, which arbitrates DMA requests 
and target register read/write requests. The logic 240 
also converts memory access operations into com- 
mands between the bus logic 200 and the memory in- 
terface 110. The bus logic also sends register values 
directly to the sequencer 130 for controlling rendering 
operations and receives status 171 back from the se- 
quencer 130. 

[0025] Finally, the bus logic contains three identical 
copies of the command queue logic 300 (numbered 201 , 
202, and 203). The preferred embodiment supports 
three command queues. Typically, the logic 201-203 are 
dedicated to processing DMAin commands, render 
commands, and DMAout commands, respectively, as 
described below. In the preferred embodiment, the com- 
mand queue ring buffers 1 90 are stored in the rendering 
memory 160 or the host memory 12. Alternatively, the 
buffers could be part of the command queue logic 300. 

Command Queue Logic 

[0026] As shown in Figure 2, the rendering system 
supports three command queues 300: Dmaln 201, 
Render 202, and DmaOut 203. In a typical usage, the 
Dmaln command queue 201 controls the transfer of da- 
ta from the host memory 12 Into the rendering memory 
160. The Render command queue 202 renders the ren- 
dering data 111 according user supplied parameters, in- 
cluding copying graphics data from place to place within 
rendering memory 160. The DmaOut command queue 
203 controls the transfer of data from the rendering 
memory 160 to the host memory 12. This data can be 



images or partially processed rendering data. Each 
command queue buffer 190 can reside either In host 
memory 180 or in the rendering memory 160. 
[0027] Figure 3 shows the logic 300 that implements 

5 each of the command queues 201-203. There is one 
copy of the command queue logic 300 for each com- 
mand queue. The logic 300 includes command queue 
state registers 310, a DMA state machine 320, a parse 
state machine 330, a register an^y 340, pointer logic 

10 350, and status logic 360. Each command queue logic 
also reports status Information to a shared status regis- 
ter 600. Note, there is only one status registers for all 
three copies of the logic. The status register is one of 
the render registers 230. 

15 [0028] The state registers 310 specify the state of the 
associated command queue, including the location of 
the associated command queue buffer 190 in the ren- 
dering memory 160 or the host memory 12, and the cur- 
rent position in the buffer. Specifically, the registers are 

20 Size 311, subConsumer 312, SubBase 313, Base 314, 
Consumer 315, Producer 316, and Scratch 317 regis- 
ters. 

[0029] The data values stored in the command queue 
buffers 190 control data transfers, rendering operations 

25 and allow for the synchronization of transfer and render- 
ing commands. Therefore, the buffers store commands 
(operators) and data (operands) as elements c and d. 
Application software, or other means, writes the ele- 
ments c and d sequentially to empty entries in the com- 

30 mand queue buffers 190. Different commands c may 
have different amounts of data d associated with them, 
or in some cases no data at all. 
[0030] The DMA state machine 310 reads elements 
from the command queue buffer 190 into local storage, 

35 i.e., register an^ay 340, while the Parse state machine 
330 reads or "consumes" the elements. If the command 
queue buffer 1 90 is empty or if a command c being proc- 
essed requires synchronizing to an event that has not 
occurred yet, then the associated process stops and 

40 waits until more elements are written to the queue, or 
the event occurs. 

[0031] In the preferred embodiment, each command 
queue buffer 190 is arranged as a linear array in either 
the rendering memory 160, or some external memory in 

45 the PCI address space, such as host memory 12. Com- 
mands c and data d elements are written and read by 
incrementally stepping through the linear array. When 
the end of the linear array is encountered, the stepping 
wraps around to continue at the beginning in a circular 

50 manner. In other words, the buffers 1 90 are operated as 
circular buffers. 

[0032] The command queue state registers 31 0 con- 
trol this process. Registers Base 314 and Size 311 de- 
scribe the location and length of the associated queue. 
55 Depending upon the setting of the least significant bit of 
Base, the queue is either in external memory 12 or the 
rendering memory 160. In particular, if the least signifi- 
cant bit is 0, then the queue is in the PCI address space. 
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and if the least significant bit is 1, then the queue Is in 
the rendering memory 160. 

[0033] The Size register 31 1 has a value one less than 
the actual number of entries in the queue, which is a 
power of two. The Size register can then be used as a 
mask with the queue pointers so that complex wrap- 
around address arithmetic is not required. 
[0034] The Producer 316 and Consumer 315 regis- 
ters are Indexes into the command queue, according to 
the standard ring buffer model that Is known in the art. 
In particular, the Producer 316 register points to the first 
free entry In the queue where new commands or data 
elements can be written. The Consumer 315 register 
points at the element currently being processed. Both 
the Producer and Consumer registers are masked by 
the Size register, so that their low order bits represent 
relative Indexes into the memory array representing the 
queue and their high order bits are discarded. Because 
of this masking, simple arithmetic can be used for incre- 
menting; no test for wraparound is needed. 
[0035] When processing is complete for a command, 
the pointer logic 350 increments Consumer 315 so that 
it points to the next entry In the command queue. If its 
masked value is equal to the masked value of Producer, 
then the queue is empty and parsing temporarily stops 
until new commands are written to the queue. Other- 
wise, a next command c is read and its processing is 
initiated. 

[0036] Application software executing on the host 
processor can write commands and data to the queue 
while previously written commands are processed. The 
software does so by writing commands and data into the 
queue starting at the entry denoted by the masked value 
of Producer 316. When the writing is complete, the soft- 
ware atomically updates Producer by the number of el- 
ements written to the queue, so that Producer points to 
the next empty entry in the queue. 
[0037] After the last element in the command queue 
is processed, the next element is written at the begin- 
ning of the queue. Of course, the software must ensure 
that Producer never "catches up" with Consumer, else 
the software would ovenwrite commands that the Parse 
State Machine 330 has not yet processed. 
[0038] Registers SubBase 313 and SubConsumer 
312 support a command queue "subroutine" facility de- 
scribed below. The Scratch register 31 7 associated with 
each command queue is provided for the convenience 
of software. This register can hold temporary values and 
can be be loaded, stored, incremented, and tested with 
a "sync" command described below. 
[0039] Note that Producer is updated only by soft- 
ware, never by the command queue logic 300, except 
in response to a reset of the command queue as defined 
below. Consumer and SubConsumer are only updated 
by the command queue logic, never by software. More- 
over, these registers are updated atomically. Therefore, 
these registers may be read at any time without concern 
for race conditions with respect to comparisons between 



Producer and Consumer. 

Other Registers for use with Command Queues 

5 [0040] Figure 8 lists a set of scratch and temporary 
registers 600 for use by software In managing queues, 
synchronizing rendering and DMA events, and manag- 
ing extemal memory. The registers 800 include Scratch 
801, ScratchDouble 802, and Memory Management 

10 803. Each of these registers can be be loaded, stored, 
and Incremented by software or by hardware using com- 
mand load Reg and incrReg, which are described below. 
In addition, the registers may be tested by the sync com- 
mand. 

15 [0041] These registers are particularly intended to 
support the management and transfer of pixel and depth 
arrays containing embedded polygons generated by a 
graphics board connected to the host processor. They 
are loaded and Incremented atomically by command 

20 loadReg and incReg, which are described below. There- 
fore, command queues exchanging data and synchro- 
nizing themselves with these registers need not be con- 
cemed with race conditions between the updating and 
testing. 

25 

States of Command Queues 

[0042] As shown in Figure 4, each command queue 
can be in one of five states while parsing commands: 

30 running 40 1 , waiting for commands 402, waiting for sync 
403, haited 404, or reset 405. State transitions are 
shown as directed edges 410 between the states. For 
example, the transitions labelled "HALT || RESEF" indi- 
cate that the transition occurs during a cycle in which 

35 either or both (||) of the tialt and reset bits of a Queue- 
Control (see below) are true for that particular command 
queue. 

[0043] In the running state 401 , the command queue 
logic 300 reads commands and data from the command 

40 queue, starting at the command pointed to by Consum- 
er, and parses and processes each command in turn. 
[0044] In the waiting for commands state 402, com- 
mand parsing has temporarily stopped because the 
masked value of Consumer has caught up with the 

45 masked value of Producer. Command parsing, i.e., the 
running state 401, resumes automatically following the 
update of Producer. At that time new commands are 
read from the queue and processed. Processing data 
transfer command means causing DMA or register 

50 transfer, and processing render commends means 
causing rendering operation in the rendering engine, e. 
g., the pipelines 140. 

[0045] In the waiting for sync state 403, command 
parsing has temporarily stopped because a sync com- 
55 mand, see below, is waiting for the value of a register to 
satisfy a test. Command parsing resumes automatically 
following the update of the register so that the test of the 
sync command is satisfied. 
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[0046] In the halted state 404, command parsing is 
stopped, regardless of the values of Producer and Con- 
sumer. The halted state occurs either as a result of writ- 
ing to the corresponding halt bit of the QueueControl 
register 500, see below with respect to Figure 5, or as 
a result of parsing a command with the halt or haltAH bit 
set, see below. Parsing can only be resumed by clearing 
the halt bit in the QueueControl register 500. 
[0047] When a command queue Is in the halted state 
404, the entire state of the command queue is contained 
in its register set 310. No hidden or internal state of the 
command queue is preserved. If the queue is not empty, 
then Const//7?er points to the next command to be proc- 
essed. When the halt bit is cleared, commands are re- 
fetched from the buffer 190 Into the local register array 
340. This allows software to modify queue elements in 
the command queue buffer 190 while the queue is halt- 
ed and ensures that the command queue logic 300 
reads those modified elements into internal buffer 340 
before restarting. 

[0048] Finally, in the reset state 405, the command 
queue is being reset to a defined initial state. Removing 
the reset signals leaves the command queue in the halt- 
ed state. 

Queue Control Register 

[0049] As shown in Figure 5, a QueueControl register 
500 controls the state of a con^esponding one of the 
command queues. This register provides bits for reset- 
ting each of the command queues, fields for halting the 
command queues, and a bit for strictly controlling where 
DMA commands can be processed. The bits include 
strict 501 , resetDmaOutQueue 502. resetRenderQueue 
503, resetDmalnQueue 504, haltDmaOut 505, 
haltRender 506, and haltDmaIn 507. 

Resetting a Command Queue 

[0050] Writing a "one" to any of the three reset bits 
502-504 causes the con-esponding command queue to 
enter the reset state 405. All of its registers 310 are set 
to zero. Resetting a command queue Interrupts parsing 
and processing of commands in the queue. 

Halting a Command Queue 

[0051] Writing a "one" to any of the three halt bits 
505-507 causes the con-esponding command queue to 
enter the halted state 404 when processing of the cur- 
rent command is complete. In particular, commands 
whose processing takes of long duration, such as DMA 
transfers with long scatter-gather lists or loadReg com- 
mands with long lists of registers will run to completion 
before the con-esponding command queue comes to the 
halted state. This allows software to single step the 
queues during debug, for example. 
[0052] If the queue is in the waiting state at the time 



Its halt bit is set, then the queue immediately enters the 
halted state. If the waiting state was a result of the test 
of a sync command not being satisfied, then the sync 
command is interrupted and the Consumerpointer is set 
5 so that the sync will be reexecuted when the halt bit is 
cleared. 

[0053] Clearing the halt bit causes that command 
queue to retum to either the executing or waiting state, 
depending upon the values of Its registers. In particular, 

10 if SubBase is not zero, then the command pointed to be 
SubConsumer is re-fetched and executed. If SubBase 
is zero and Consumer is not the same as Producer, then 
the command pointed to by Consumer Is re-fetched and 
executed. If Consumer and Prodtycer point to the same 

15 element In the command queue, then the queue is 
deemed to be empty and its enters the waiting state. 
[0054] If the command pointed to by Consumer or 
SubConsumer is a sync command, then the command 
is re-fetched just like any other command and its test Is 

20 re-evaluated. If the test Is not satisfied, then the queue 
enters the waiting state again. 
[0055] Command queues may also be halted by com- 
mands contained within the queues. To summarize, a 
command queue can be halted in any of four ways. 

25 [0056] First, application software can set its halt bit in 
the QueueControl register 500 to one. The command 
queue will halt at the end of Its current command but will 
not start a new command. If the current command Is a 
sync and the test has not been satisfied, then the sync 

30 command will be aborted. The halted bit in QueueSta- 
tus, see below, indicates when the queue has actually 
entered the halted state. 

[0057] Second, a command queue may have the halt 
bit set in one of its own commands. In this case, the 

35 processing of the command will complete, and the halt 
bit Is written to the con-esponding entry of QueueCon- 
trol, and the command queue immediately enters the 
halted state. If the command Is a sync command, then 
the halt bit is not written until the sync test is satisfied. 

40 This is useful for single-stepping through commands 
during debugging. 

[0058] Third, some command queue may have the 
haltAH bit set in one of Its commands. In that case, when 
the command containing the haltAH bit completes, ones 

45 are written to all three halt bits in QueueControl, Its own 
command queue will immediately enter the /la/fed state, 
just as if it had the halt bit set in its command instead of 
haltAH. The other two queues behave as If software had 
written a one to their halt bits, I.e., processing of their 

50 current commands will complete before their queues en- 
ter the halted state, but sync commands waiting for un- 
satisfied tests will be interrupted. The halted bits of 
QueueStatus 600, described below, indicate when the 
respective command queue has actually entered the 

55 halted state. 

[0059] Fourth, software can set the corresponding re- 
set bit of QueueControl. In this case, the command 
queue halts after the reset bit is cleared. 
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Strict and Lax Control of DMA Commands 

[0060] The strict bit 501 of the QueueControl register 
500 detemnines which command queues can process 
which of the DMA commands. 5 
[0061] When strict is set to one, dmain commands 
may be processed only from the DmaIn queue, and 
dmaOut commands may be processed only from the 
D/naOuf queue. Any violation of this rule causes the halt 
bits of alt three queues to be immediately set to one and io 
an interrupt to be generated. The violating command is 
not processed, and Consumer or SubConsumer of the 
queue containing the violation is left pointing to that 
command. This enforces the typical usage of the three 
queues and detects errors when that usage is violated. 15 
[0062] When strict is set to zero, any command can 
be parsed and processed in any queue. This is useful, 
for example, during software development when the 
rendering engine may be controlled with only one com- 
mand queue so there Is no overlapped processing. It is 20 
also useful when software divides up commands among 
the three queues in a manner other than the typical di- 
vision of dmaIn commands, render commands, and 
dmaOut commands as described above. 

25 

Command Queue Status 

[0063] Figure 6 shows a QueueStatus register 600. 
The registers 600 are maintained by the status logic 
360. This register indicates the status of the command 30 
queues and whether each has been halted. Each of the 
three halted bits 604-606 indicate that the correspond- 
ing command queue has actually entered its halted state 
404. 

[0064] The status of each command queue is contin- 35 
ually updated in con*esponding two-bit status fields 
601-603. When the queue is not halted, these continu- 
ously update. When the queue halts, this status is 
latched to indicate what the command queue was doing 
at the time it halted. This status can be examined by ^0 
software, for example during debugging. 
[0065] After the queue halts, a status of executing 
means that processing a command was in progress at 
the time the halt bit was set. A status of "waiting because 
Producer = Consumer" means that no command was 45 
being processed when the halt bit was set because the 
queue was empty. A status of "waiting for sync test to 
be satisfied" means that an unsatisfied sync command 
was inten^upted by the setting of the halt bit. 

50 

Command Queue Commands 

[0066] Figure 7 shows the fomnat 700 for queue com- 
mands. A command comprises a command header c 
(operator) which can be followed followed by data (op- 55 
erand). Typical actions during the processing of com- 
mands are to load or store registers, test register values 
for synchronization purposes, and to initiate DMA trans- 



fers. To initiate a rendering operation from within a com- 
mand queue, rendering registers are loaded by one or 
more loadReg commands. 

[0067] Each command header contains the following 
fields. An operator code in the cmd field 701 indicating 
the kind of command to be processed. The operators 
are enumerated in the subsections below, together with 
the operand 705 for each operator code. 
[0068] When the interrujpt field 702 is set, an inten-upt 
is signalled at the completion of processing this com- 
mand. 

[0069] When the haitAil field 703 is set, all three halt 
bits in the QueueConfro/ register 500 are set to one upon 
the completion of the cun^ent command. This causes the 
current command queue to enter the halted state with 
its Consumer or SubConsumer register pointing to the 
next command to be parsed and processed. The other 
two command queues enter the halted state after 
processing of their current commands completes. 
[0070] When the halt field 704 is set, the current com- 
mand queue enters the halted state with its halt bit in 
QueueControl register is set to one as soon as the com- 
mand processing is completed. Its Consumer or Sub- 
Consumer register points to the next command in its 
queue. 

Noop command 

[0071] A noop command does nothing and takes one 
clock cycle to complete. During the noop command, the 
inten-upt, halt, and haltAII bits in the command header 
are parsed as in any other command. This command 
can be used as a placeholder by the application soft- 
ware or to force a halt between two separate sequences 
of commands. The operand 705 is ignored. 

DMA commands 

[0072] Two DMA commands, dmain and dmaOut, are 
provided for causing DMA transfers from within the bus 
logic 200. Each DMA command has a count operand 
705 in the command header, followed by a list of the 
specified number of data elements within the queue. 
Each data element contains a hostAddress, a hostAd- 
dresslncr, and a length. Regardless of which DMA com- 
mand is invoked, the DMA transfer represented by the 
hostAddress and length proceeds to completion. 
[0073] Processing of a DMA command cannot be in- 
terrupted by writing a halt bit to the QueueControl reg- 
ister. Processing of the command completes before the 
queue enters the halted state. Processing of a DMA 
command is, however, interrupted when its command 
queue is reset. This interrupt occurs at the natural end 
of a single DMA transfer of length bytes, whether or not 
count has been exhausted. 

[0074] If the strict bit 501 is set in the QueueControl 
register 500, then the dmain command may only be in- 
voked from the Dmain queue, and the dmaOut com- 
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mand may only be invoked from the DmaOut queue. 
Sync command 

[0075] As an advantage of the present invention, the 
synchronization (sync) command provides a flexible 
mechanism for synchronizing the multiple command 
queues with each other and with application software. 
Executing a sync command for a particular command 
queue causes the register specified by the operand 705 
to be compared (equal, greater than, less than, etc.) with 
the value in a testValue operand. The comparison rela- 
tion is specified by the test operand. The sync command 
itself comprises only its command header; it is not fol- 
lowed by any data elements. 

[0076] If the comparison is true, then the parsing of 
elements in the command queue continues with the next 
command. If the comparison is false, then the command 
queue Is placed In the waiting state until the register 
specified by reg is updated to cause the comparison to 
become true. That is, at some later time, when software 
or some other command queue updates the register, the 
test of the sync command is finally satisfied. Then the 
command queue containing the sync command chang- 
es state from waiting to running, and the next command 
is fetched for parsing and processing. 
[0077] The sync command is special with respect to 
halting. If a command queue has an outstanding unsat- 
isfied sync command, i.e., a sync command which has 
been attempted but the test is not yet been satisfied, 
and the halt bit in QueueControl for that command 
queue is set to one by software or some other command 
queue, then the sync command is immediately aborted. 
The Consumer or SubConsumer register of that com- 
mand queue is set to point to the sync command, as if 
it had never processed. Then later, when the halt bit in 
QueueControl is cleared, the sync command is re- 
fetched from the command buffer 1 90 into register an-ay 
340 and tried again. This gives application software 
some predictable control over sync during the halted 
state. From the software point of view, sync either suc- 
ceeds or it is left In the state that it would have been if 
it had not yet been tried. No hidden state, such as the 
comparison value, is preserved in a command queue 
across a halt. 

[0078] The sync command also contains a refetch bit 
that causes similar behavior. In particular, if refetch is 
set to one, then the command queue buffer data in the 
register array 340 will be flushed and refetched, just as 
if it had halted at that point. This allows software to mod- 
ify the commands in the queue while the hardware is 
waiting for the sync/? test to be satisfied, and guarantees 
that the modified commands will be read into the register 
array 340. This is useful, for example, for debug and for 
situations where a "dummy" value is written into a sub- 
sequent location in the command queue, which can only 
be filled in with the correct value after the specified sync 
condition is met. 



Register Commands 

[0079] The storeReg, loadReg, and incrReg com- 
mands enable writing, reading and incrementing the 
5 rendering engine registers. The registers' data can be 
transfered to and from the PCI address space. The stor- 
eReg command specifies a register address as its op- 
erand, followed by a 64-bit PCI address as a data ele- 
ment in the queue. The loadReg command allows load- 
to ing a variable number of contiguous registers as speci- 
fied by a register address and a mask in its operand, 
followed by a data element in the queue for each bit that 
is set in the mask, each data element representing a 
64-bit register value. Finally, the /ncReg command con- 
15 tains a register address and an increment value in its 
operand, which it adds to the specified register. 

Subroutine and Return 

20 [0080] The Subroutine and Return commands allow 
procesing of other queue commands that are not "in- 
line," i.e., a "command subroutine." The subroutine 
command causes processing of the command subrou- 
tine, and then proceeds processing commands from the 
25 queue in their normal order. A data element following 
the subroutine command header is treated as a memory 
address and is written to the SubBase register 313 of 
the command queue invoking the subroutine. The return 
address is held in the Consumer register 315, which 
30 points to the element in the main command queue fol- 
lowing the Subroutine command and its data element. 
Subroutine commands are stored as a list of elements 
containing commands and data, just as for a command 
queue buffer. However, the subroutine command queue 
35 buffer is not a circular buffer. It is simply a linear list of 
commands, temiinated by a Return command. 
[0081] In the prefen-ed embodiment, any commands 
may be placed in a subroutine command queue, except 
the subroutine command itself. That is, subroutines do 
40 not invoke other subroutines. However, in other embod- 
iments, subroutines could invoke other subroutines, ei- 
ther by providing additonal on-chip subBase registers 
or by storing data in off-chip memory. 
[0082] Subroutines are most typically used for DMA 
45 commands. This makes it convenient for application 
software to keep the access information separate from 
a template in the command queue buffer for certain 
kinds of activity. A subroutine may be interrupted when 
its command queue is placed in the halted state. When 
50 the halt bit for that command queue is cleared, it Is noted 
that the SubBase register is not zero. The next conr> 
mand is fetched from the element in the subroutine com- 
mand list denoted by Sub Consumer, rather than from 
the main command queue. 
55 [0083] The Return command causes parsing to 
resume from the main command queue at the entry 
pointed to by Consumer It clears both SubConsumer 
and SubBase to zero, so that there is no confusion as 
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to where to restart command processing following a halt. 
If the halt bit of a subroutine command is set, then the 
command queue halts after the call of the subroutine but 
before executing the first command of the subroutine. 
When halted, SubBase points to the address of the com- 
mand following the subroutine command, SubConsum- 
er is zero, and Consumer points to the command to 
resume following the return from the subroutine. If the 
halt bit is set in a return command, then the command 
queue halts after the return from the subroutine but be- 
fore proceeding with the first command following the 
subroutine call. In this case, SubBase and SubConsum- 
er are cleared, and Consumer points to the command 
to be processed next. 

Conclusion 

[0084] Thus, the invention provides a command 
queue structure that separates input and output data 
transfers from rendering operations, and also enables 
synchronization between data transfers and rendering 
operations. This is an advantage when the transfers and 
operations take a long time, that is, many clock cycles. 
By separately handling these activities, operations that 
take a long time do not need to wait for each another, 
except when specifically required for correct operation, 
e.g., delaying copying an image to system memory until 
rendering is complete. These synchronization delays do 
not require software intervention, but can be handled 
completely within the command queue control logic. 
[0085] The command queue structure as described 
herein can be applied to any kind of graphics rendering 
engine, or any other hardware accelerator, for that mat- 
ter. The invention is especially advantageous when 
there are a large number of rendering activities that con- 
currently require host processing, sub-system process- 
ing, and high bandwidth DMA transfers. 
[0086] Although the invention has been described by 
way of examples of prefen-ed embodiments, it is to be 
understood that various other adaptations and modifi- 
cations can be made within the spirit and scope of the 
invention. Therefore, it is the object of the appended 
claims to cover all such variations and modifications as 
come within the true spirit and scope of the invention. 



Claims 

1 . A rendering system, comprising: 

a memory storing data transfer and rendering 
commands in a plurality of command queues; 

command queue logic for each of plurality of 
command queues, each command queue logic 
concurrently fetching, parsing and processing 
the data transfer and rendering commands of 
a corresponding command queue; and 



means for synchronizing the concun-ent fetch- 
ing, parsing and processing of the data transfer 
and rendering commands while rendering 
graphic data. 

5 

2. The rendering system of claim 1 wherein sync com- 
mands stored In the plurality of command queues 
synchronize the plurality of command queues. 

10 3. The rendering system of claim 2 wherein the sync 
commands are generated by a host processor con- 
nected to the rendering system. 

4. The rendering system of claim 2 wherein the sync 
15 commands are generated by a rendering pipeline 

of the rendering system. 

5. The rendering system of claim 1 wherein at least 
one of the command queues is stored in a host 

20 memory. 

6. The rendering system of claim 1 wherein at least 
one of the command queues is stored in a rendering 
memory. 

25 

7. The rendering system of claim 1 wherein the plural- 
ity of queues are an^anged as circular buffers with 
a plurality of elements to stpre the data transfer and 
render commands. 

30 

8. The rendering system of claim 1 wherein each com- 
mand queue logic further comprises: 

a DMA state machine for fetching the data 
35 transfer and render commands from the corre- 

sponding command queue; and 

a parse state machine for parsing, processing 
and synchronizing the data transfer and render 
40 commands. 

9. The rendering system of claim 1 wherein the data 
transfer commands are issued to a DMA interface 
and the rendering commands are issued to a plu- 

45 rality of parallel pipelines. 

10. The rendering system of claim 8 wherein the parse 
state machine has reset, halted, running, waiting for 
commands, and waiting for sync states. 

50 

1 1 . The rendering system of claim 2 further comprising: 

a scratch register for storing events, and where- 
in the sync commands tests the scratch register 
55 for an occurrence of particular events. 

12. The rendering system of claim 11 wherein parsing 
and processing of the data transfer and rendering 
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commands is stopped while waiting for the oc- 
curence of the particular event. 

13. The rendering system of claim 1 wherein each com- 
mand queue includes a plurality of entries for stor- 5 
ing the data transfer and rendering commands and 
further comprising: 

a producer register for each command queue 
for storing a first pointer to a first entry storing io 
a next command to be fetched; and 

a consumer register for each command queue 
for storing a second pointer to a second entry 
to store a next command. 15 

14. A rendering system, comprising: 

a memory configured to store data transfer, ren- 
dering and sync commands in a plurality of 20 
command queues; 

a plurality of registers configured to store 
events; and 

25 

command queue logic, for each of plurality of 
command queues, each command- queue log- 
ic for concurrently fetching, parsing and 
processing the data transfer and rendering 
commands of a con^esponding command 30 
queue, and synchronizing the concurrent fetch- 
ing, parsing and processing of the data transfer 
and rendering commands by testing for partic- 
ular events stored in the plurality of registers 
using the sync commands while rendering da- 35 
ta. 

15. A method for rendering, comprising: 

storing data transfer and rendering commands 40 
in a plurality of command queues; 

concurrently fetching, parsing and processing 
the data transfer and rendering commands of 
a corresponding command queue; and 45 

synchronizing the concurrent fetching, parsing 
and processing of the data transfer and render- 
ing commands while rendering graphic data. 

50 

16. The method of claim 15 further comprising: 

storing events in registers; and 

testing the registers for occurences of particu- 55 
far events with sync commands to synchronize 
the concurrent fetching, parsing and process- 
ing of the data transfer and rendering com- 
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Field 


Description 


501 ^ 


strict 


Governs which queues may execute dmain 
and dmaOut commands. 

0: Lax: all commands valid in any 
queue. 

1 : Strict: dmaIn commands may be 
executed only in DmaIn queue and dmaOut 
commands may be executed only in 
DmaOut queue. 


502--- 

503 
504-^ 


resetDmaOutQue 

resetRenderQueue 
resetDmalnQueue 


Reset bits for the three command queu 
0: Normal operation. 
1; Reset. 


505-^ 


haltDmaOut 


Halt control for the three command qu 

0: Clear a halted condition. Fetch next 
command from memory (if Consumer 
Producet) and attempt to execute. 


506- ^ 

507- ^ 


haltRender 
haltDmaln 


1 : Request the queue to halt 
the current command. 



Y" 



500 

FIG. 5 
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Field Name 


Description 




dmaOutHalted 


Halt state of command ouaue: 












0: The command queue is not the halted state. 


605-^ 










1 : The command queue Is in the halted state 






i.e.. it has really halted. 


606-^ 


renderHalted 






dmalnHalted 




601 ^ 


dmaOutStatus 


State of the command queue; latched upon halt. 






0: executing. 






1 : waiting because Producer = Consumer. 






2: waiting for sync test to be satisfied. 


602-^ 


renderStatus 




603--- 


dmalnStatus 





600 

FIG. 6 
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701 



702 « 

703- 

704 < 

705 < 



Reld Name 


DescriDtior^ 


cmd 


Command to be executed: 

8: return from subroutine call 

7: subroutine 

6: incrReg 

5: storeReg 

4: loadReg 

3: sync 

2: dmain 

1: dmaOut 

0: noop 


interrupt 


When this command has completed, post 
an interrupt 


haltAII 


When this command completes, halt all 
three queues 


halt 


When this command completes, halt this 
queue 


operands 


Specific interpretation is dependent on 
cmd 



700 



FIG. 7 
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Register 


Type 


Description 


801 ^ 


Scratch 


FVW 


Registers for software use 


802--- 


ScratchDouble 


R/W 


Long registers for software use 


803 


MemoryManagement 


R/W 


Synchronization registers dedicated to 
managing buffers that exist 
somewhere in external PCI memory. 



800 

FIG. 8 
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